Coding for DS and DM
R coding module

Lecture 6

Andrea Cappozzo
andrea.cappozzo@unimi.it
AndreaCappozzo
andreacappozzo.rbind.io

Meme of the day

Tibbles - Definition

A tibble is a new way to store data in R in a tabular format. There are some slight (but important) differences between tibbles and data frames:

  1. Printing: For tibbles, only the first 10 rows and the first 6 columns of the dataset are printed. The class of each variable is also displayed.
  2. Subsetting: Tibbles do not perform partial matching. Additionally, subsetting a tibble always returns a tibble.
  3. In tibbles, character variables are not converted to factors. Also, tibbles do not have row names.

Tibbles vs dataframes

head(iris,3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
iris_tbl <- tibble::as_tibble(iris)
iris_tbl
# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows

Partial matching

head(iris$Spe,3)
[1] setosa setosa setosa
Levels: setosa versicolor virginica
iris_tbl$Spe
Warning: Unknown or uninitialised column: `Spe`.
NULL

Subsetting

head(iris[,1])
[1] 5.1 4.9 4.7 4.6 5.0 5.4
iris_tbl[,1]
# A tibble: 150 × 1
   Sepal.Length
          <dbl>
 1          5.1
 2          4.9
 3          4.7
 4          4.6
 5          5  
 6          5.4
 7          4.6
 8          5  
 9          4.4
10          4.9
# ℹ 140 more rows

Tibbles

  • If you use readr to import external data, remember that the result will always be a tibble.
  • In some cases (especially with very old and outdated functions), you may encounter errors if you pass data in the form of a tibble instead of a data frame.
  • To convert a tibble into a data frame, use the function as.data.frame().

ggplot2 - Philosophy

  • The simple graph has brought more information to the data analyst’s than any other device.
    (John Tukey)
  • You can also find a more detailed explanation of how this package works by consulting the book ggplot2: Elegant Graphics for Data Analysis.

ggplot2 - Introduction

  • Let’s start by loading the package: library(ggplot2)
  • In the next examples, we will use the mpg dataset. Let’s load it into memory and read the associated help page.
library(ggplot2)
data(mpg)
# tibble::glimpse(mpg)

ggplot2 - The First Plot

  • Now, let’s create our first plot:

    ggplot(data = mpg) +
      geom_point(mapping = aes(x = displ, y = hwy))

  • What relationship exists between hwy (miles per gallon) and displ (engine displacement)?

ggplot2

  • The first function we used, ggplot, creates an empty plot.
  • This empty plot can then have various layers (defined by the user) added to modify it. For example, in the previous plot, the geom_point function adds a layer to the empty plot representing the scatterplot of hwy vs displ.
  • To define how a layer is created (and how the variables in the dataset are mapped to the plot), we use the aes function, within which we specify which values to map to the x-axis and the y-axis.

ggplot2

  • In general, a plot in ggplot2 is created using the following command:

    ggplot(data = DATA) +
      GEOM_FUNCTION(mapping = aes(MAPPINGS))
  • GEOM_FUNCTION is a function that creates a layer, and MAPPINGS are the parameters we pass to the function. As the analysis becomes more complex, we will continue to add more layers.

ggplot2 - Aesthetic Mappings

  • Given a particular GEOM_FUNCTION (like geom_point), there are various types of aesthetics that can be modified to customize many aspects of each layer. A complete list is available by reading the help associated with each function.
  • For example, let’s look at the help for the geom_point function. We see that the possible aesthetics are not just x and y (the coordinates of the point), but also colour, size, and shape.

ggplot2 - Aesthetic Mappings

  • Now, let’s create our second plot:

    ggplot(data = mpg) +
      geom_point(mapping = aes(x = displ, y = hwy, color = class))

  • How does the relationship between hwy and displ change with different vehicle types?

ggplot2 - Aesthetic Mappings

  • We can also associate the class variable with different characteristics of a point, such as its size:

    ggplot(data = mpg) +
      geom_point(mapping = aes(x = displ, y = hwy, size = drv))
    Warning: Using size for a discrete variable is not advised.

  • or its shape:

    ggplot(data = mpg) +
      geom_point(mapping = aes(x = displ, y = hwy, shape = drv))

ggplot2 - Aesthetic Mappings

  • You can modify the characteristics of all points in any GEOM_FUNCTION without mapping them to a variable.
  ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), col = "blue")

ggplot2 - GEOM_FUNCTIONS

  • There are many other GEOM_FUNCTIONS besides geom_point that are used to create different types of plots (e.g., histograms, bar charts, error line charts…).
  • You can read a list at the following link: ggplot2 Reference.

ggplot2 - GEOM_FUNCTIONS

  • All possible GEOM_FUNCTIONS require one or more aesthetics, but not all can work with the same aesthetics.
  • It doesn’t make sense to talk about shape in a bar chart.
  • For example in the next line we use geom_smooth to create a sort of trend line using a different type of geometry than geom_point

ggplot2 - GEOM_FUNCTIONS

  ggplot(data = mpg) +
    geom_smooth(mapping = aes(x = displ, y = hwy))
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot2 - GEOM_FUNCTIONS

  • Let’s try to modify the linetype:

    ggplot(data = mpg) +
      geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
    `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

    This will give us a trend line for the different values of drv.

  • Let’s also modify the color of the curves:

    ggplot(data = mpg) +
      geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv, colour = drv))
    `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot2 - GEOM_FUNCTIONS

  • One of the best features of ggplot2 is the ability to easily represent two or more geometries on the same plot:

ggplot2 - GEOM_FUNCTIONS

  ggplot(data = mpg) +
    geom_smooth(mapping = aes(x = displ, y = hwy)) +
    geom_point(mapping = aes(x = displ, y = hwy))
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot2 - GEOM_FUNCTIONS

  • In this case, however, we see that the code is unnecessarily repetitive.
  • We can avoid this problem by specifying the common aesthetics inside the ggplot function and the unique aesthetics inside the GEOM_FUNCTION:

ggplot2 - GEOM_FUNCTIONS

  ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
    geom_point(mapping = aes(col = class)) +
    geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot2: endless power

  • With a bit of effort and experience, it is possible to create beautiful plots with this package
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, col = class)) +
  geom_point(size = 3, alpha = 0.7) +  # Larger points with transparency
  geom_smooth(se = FALSE, linetype = "dashed", linewidth = 1.2,span=1.5) +  # Smoother lines without confidence interval
  scale_color_brewer(palette = "Set1") +  # Use a colorblind-friendly palette
  labs(
    title = "Fuel Efficiency vs Engine Displacement",
    subtitle = "Relationship between engine size and highway fuel efficiency across car types",
    x = "Engine Displacement (liters)",
    y = "Highway Fuel Efficiency (mpg)",
    color = "Vehicle Class"
  ) + 
  theme_minimal(base_size = 15) +  # Clean minimalistic theme
  theme(
    plot.title = element_text(face = "bold", size = 18, hjust = 0.5),
    plot.subtitle = element_text(size = 14, hjust = 0.5),
    legend.position = "bottom",  # Move legend to the bottom
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10),
    legend.background = element_rect(fill = "gray95", color = NA)
  )

ggplot2 in production

  • The British public service broadcaster BBC uses ggplot2 to create their data visualization
  • Take a look at bbplot
  • An interesting video from a conference I attended about data journalism and ggplot 2 here

ggplot2 - esquisse

  • There are plug-ins that can be added to RStudio to help users during the data import and exploration phase. One of these is esquisse, available at esquisse GitHub.
  • To install the package associated with the plug-in, use the command install.packages("esquisse"). Launch it by clicking on the ‘ggplot2’ builder in the drop-down menu called Plug-in, available under the menu bar.

Combining different ggplots

  • The live saver in this case is the patchwork package
  • Quoting: it makes ridiculously simple to combine separate ggplots into the same graphic

Patchwork

gg_bar <- ggplot(mpg) +
  geom_bar(aes(x=drv)) +
  theme_bw()

gg_bubble <- ggplot(data = mpg) +
  geom_point(mapping = aes(
    x = displ,
    y = hwy,
    size = cyl,
    color = cyl,
    alpha = cyl
  )) +
  scale_color_viridis_c(option = "C") +
  theme_bw()

Patchwork

library(patchwork)
gg_bar +
gg_bubble

Patchwork

library(patchwork)
gg_bar /
gg_bubble

Patchwork

library(patchwork)
(gg_bar + gg_bar)/
gg_bubble

Exercise: reproduce the plots